The choice of ( K ) in a ( K ) Nearest Neighbors (KNN) algorithm significantly affects the quality of your predictions.
Small ( K ) values: When ( K ) is small (e.g., ( K = 1 )), the model is highly sensitive to noise in the training data. This can lead to overfitting, where the model performs well on the training data but poorly on unseen data.
Large ( K ) values: As ( K ) increases, the model becomes more generalized. It considers more neighbors, which can help smooth out noise but may also lead to underfitting, where the model is too simple to capture the underlying patterns in the data.
Optimal ( K ): The optimal value of ( K ) balances bias and variance. It is typically determined through cross-validation, where different ( K ) values are tested, and the one that provides the best performance on validation data is chosen.
In summary, the choice of ( K ) is crucial for the performance of the KNN algorithm. It requires careful tuning to achieve the best balance between overfitting and underfitting.
3. Feature Engineering
Create a version of the year column that is a factor (instead of numeric).
Create dummy variables that indicate the presence of “cherry”, “chocolate” and “earth” in the description.
Take care to handle upper and lower case characters.
Create 3 new features that represent the interaction between time and the cherry, chocolate and earth inidicators.
We preprocess the dataframe using BoxCox transformation, centering, and scaling of the numeric features.
We create dummy variables for the year_f column.
5. Running \(K\)NN
Split the dataframe into an 80/20 training and test set
Use Caret to run a \(K\)NN model that uses our engineered features to predict province
use 5-fold cross validated subsampling
allow Caret to try 15 different values for \(K\)
Display the confusion matrix on the test data
set.seed(505)wine_index <-createDataPartition(wino$province, p =0.8, list =FALSE)train <- wino[ wine_index, ]test <- wino[-wine_index, ]fit <-train(province ~ .,data = train, method ="knn",tuneLength =15,metric ="Kappa",trControl =trainControl(method ="cv", number =5))confusionMatrix(predict(fit, test),factor(test$province))
Confusion Matrix and Statistics
Reference
Prediction Burgundy California Casablanca_Valley Marlborough New_York
Burgundy 103 22 4 6 0
California 69 656 10 18 9
Casablanca_Valley 0 0 0 0 0
Marlborough 0 2 0 0 0
New_York 1 0 3 1 1
Oregon 65 111 9 20 16
Reference
Prediction Oregon
Burgundy 31
California 237
Casablanca_Valley 0
Marlborough 0
New_York 1
Oregon 278
Overall Statistics
Accuracy : 0.6204
95% CI : (0.5967, 0.6438)
No Information Rate : 0.4728
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3736
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Burgundy Class: California Class: Casablanca_Valley
Sensitivity 0.43277 0.8293 0.00000
Specificity 0.95610 0.6111 1.00000
Pos Pred Value 0.62048 0.6567 NaN
Neg Pred Value 0.91042 0.7997 0.98446
Prevalence 0.14226 0.4728 0.01554
Detection Rate 0.06157 0.3921 0.00000
Detection Prevalence 0.09922 0.5971 0.00000
Balanced Accuracy 0.69444 0.7202 0.50000
Class: Marlborough Class: New_York Class: Oregon
Sensitivity 0.000000 0.0384615 0.5082
Specificity 0.998771 0.9963570 0.8037
Pos Pred Value 0.000000 0.1428571 0.5571
Neg Pred Value 0.973070 0.9849940 0.7709
Prevalence 0.026898 0.0155409 0.3270
Detection Rate 0.000000 0.0005977 0.1662
Detection Prevalence 0.001195 0.0041841 0.2983
Balanced Accuracy 0.499386 0.5174093 0.6560
We set the seed for reproducibility
We split the data into training and test sets.
We use the caret package to train a \(K\) NN model with 5-fold cross-validation.
We allow the model to try 15 different values for \(K\).
We display the confusion matrix to evaluate the model’s performance on the test data.
6. Kappa
How do we determine whether a Kappa value represents a good, bad or some other outcome?
The Kappa statistic is a measure of inter-rater agreement or classification accuracy that takes into account the possibility of agreement occurring by chance. Here are the general guidelines for interpreting Kappa values:
Kappa < 0: No agreement
Kappa = 0 - 0.20: Slight agreement
Kappa = 0.21 - 0.40: Fair agreement
Kappa = 0.41 - 0.60: Moderate agreement
Kappa = 0.61 - 0.80: Substantial agreement
Kappa = 0.81 - 1.00: Almost perfect agreement
These guidelines help determine whether the agreement between the predicted and actual classifications is good, bad, or somewhere in between.
7. Improvement
How can we interpret the confusion matrix, and how can we improve in our predictions?
Interpreting the Confusion Matrix
The confusion matrix is a table that is used to evaluate the performance of a classification model. It compares the actual target values with the values predicted by the model. Here is how you can interpret the confusion matrix:
True Positives (TP): The number of instances correctly predicted as the positive class.
True Negatives (TN): The number of instances correctly predicted as the negative class.
False Positives (FP): The number of instances incorrectly predicted as the positive class (Type I error).
False Negatives (FN): The number of instances incorrectly predicted as the negative class (Type II error).
From the confusion matrix, you can derive several important metrics:
Accuracy: The proportion of the total number of predictions that were correct.
Kappa: A measure of inter-rater agreement that takes into account the possibility of agreement occurring by chance.
Improving Predictions
To improve the predictions of your model, consider the following strategies:
Feature Engineering: Create new features or transform existing ones to better capture the underlying patterns in the data. For example, you can create interaction terms, polynomial features, or use domain knowledge to create meaningful features.
Hyperparameter Tuning: Experiment with different hyperparameters to find the optimal settings for your model. For \(K\)NN, this includes trying different values of \(K\).
Cross-Validation: Use more folds or different cross-validation techniques to ensure the model generalizes well to unseen data. This helps in selecting the best model and avoiding overfitting.
Ensemble Methods: Combine multiple models to improve prediction accuracy and robustness. Techniques like bagging, boosting, and stacking can help in creating a more robust model.
Data Augmentation: If you have limited data, consider techniques to augment your dataset. This can include generating synthetic data, oversampling minority classes, or using data augmentation techniques specific to your domain.
Regularization: Apply regularization techniques to prevent overfitting. Regularization methods like L1 (Lasso) and L2 (Ridge) can help in creating a more generalized model.
Model Selection: Try different types of models to see which one performs best on your data. For example, you can compare \(K\)NN with decision trees, random forests, support vector machines, or neural networks.